47 research outputs found
Face Identity Disentanglement via Latent Space Mapping
Learning disentangled representations of data is a fundamental problem in
artificial intelligence. Specifically, disentangled latent representations
allow generative models to control and compose the disentangled factors in the
synthesis process. Current methods, however, require extensive supervision and
training, or instead, noticeably compromise quality. In this paper, we present
a method that learn show to represent data in a disentangled way, with minimal
supervision, manifested solely using available pre-trained networks. Our key
insight is to decouple the processes of disentanglement and synthesis, by
employing a leading pre-trained unconditional image generator, such as
StyleGAN. By learning to map into its latent space, we leverage both its
state-of-the-art quality generative power, and its rich and expressive latent
space, without the burden of training it.We demonstrate our approach on the
complex and high dimensional domain of human heads. We evaluate our method
qualitatively and quantitatively, and exhibit its success with
de-identification operations and with temporal identity coherency in image
sequences. Through this extensive experimentation, we show that our method
successfully disentangles identity from other facial attributes, surpassing
existing methods, even though they require more training and supervision.Comment: 17 pages, 10 figure
Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views
Object viewpoint estimation from 2D images is an essential task in computer
vision. However, two issues hinder its progress: scarcity of training data with
viewpoint annotations, and a lack of powerful features. Inspired by the growing
availability of 3D models, we propose a framework to address both issues by
combining render-based image synthesis and CNNs. We believe that 3D models have
the potential in generating a large number of images of high variation, which
can be well exploited by deep CNN with a high learning capacity. Towards this
goal, we propose a scalable and overfit-resistant image synthesis pipeline,
together with a novel CNN specifically tailored for the viewpoint estimation
task. Experimentally, we show that the viewpoint estimation from our pipeline
can significantly outperform state-of-the-art methods on PASCAL 3D+ benchmark
Bundle Optimization for Multi-aspect Embedding
Understanding semantic similarity among images is the core of a wide range of
computer vision applications. An important step towards this goal is to collect
and learn human perceptions. Interestingly, the semantic context of images is
often ambiguous as images can be perceived with emphasis on different aspects,
which may be contradictory to each other.
In this paper, we present a method for learning the semantic similarity among
images, inferring their latent aspects and embedding them into multi-spaces
corresponding to their semantic aspects.
We consider the multi-embedding problem as an optimization function that
evaluates the embedded distances with respect to the qualitative clustering
queries. The key idea of our approach is to collect and embed qualitative
measures that share the same aspects in bundles. To ensure similarity aspect
sharing among multiple measures, image classification queries are presented to,
and solved by users. The collected image clusters are then converted into
bundles of tuples, which are fed into our bundle optimization algorithm that
jointly infers the aspect similarity and multi-aspect embedding. Extensive
experimental results show that our approach significantly outperforms
state-of-the-art multi-embedding approaches on various datasets, and scales
well for large multi-aspect similarity measures
FPNN: Field Probing Neural Networks for 3D Data
Building discriminative representations for 3D data has been an important
task in computer graphics and computer vision research. Convolutional Neural
Networks (CNNs) have shown to operate on 2D images with great success for a
variety of tasks. Lifting convolution operators to 3D (3DCNNs) seems like a
plausible and promising next step. Unfortunately, the computational complexity
of 3D CNNs grows cubically with respect to voxel resolution. Moreover, since
most 3D geometry representations are boundary based, occupied regions do not
increase proportionately with the size of the discretization, resulting in
wasted computation. In this work, we represent 3D spaces as volumetric fields,
and propose a novel design that employs field probing filters to efficiently
extract features from them. Each field probing filter is a set of probing
points --- sensors that perceive the space. Our learning algorithm optimizes
not only the weights associated with the probing points, but also their
locations, which deforms the shape of the probing filters and adaptively
distributes them in 3D space. The optimized probing points sense the 3D space
"intelligently", rather than operating blindly over the entire domain. We show
that field probing is significantly more efficient than 3DCNNs, while providing
state-of-the-art performance, on classification tasks for 3D object recognition
benchmark datasets.Comment: To appear in NIPS 201
DiDA: Disentangled Synthesis for Domain Adaptation
Unsupervised domain adaptation aims at learning a shared model for two
related, but not identical, domains by leveraging supervision from a source
domain to an unsupervised target domain. A number of effective domain
adaptation approaches rely on the ability to extract discriminative, yet
domain-invariant, latent factors which are common to both domains. Extracting
latent commonality is also useful for disentanglement analysis, enabling
separation between the common and the domain-specific features of both domains.
In this paper, we present a method for boosting domain adaptation performance
by leveraging disentanglement analysis. The key idea is that by learning to
separately extract both the common and the domain-specific features, one can
synthesize more target domain data with supervision, thereby boosting the
domain adaptation performance. Better common feature extraction, in turn, helps
further improve the disentanglement analysis and disentangled synthesis. We
show that iterating between domain adaptation and disentanglement analysis can
consistently improve each other on several unsupervised domain adaptation
tasks, for various domain adaptation backbone models
Synthesizing Training Images for Boosting Human 3D Pose Estimation
Human 3D pose estimation from a single image is a challenging task with
numerous applications. Convolutional Neural Networks (CNNs) have recently
achieved superior performance on the task of 2D pose estimation from a single
image, by training on images with 2D annotations collected by crowd sourcing.
This suggests that similar success could be achieved for direct estimation of
3D poses. However, 3D poses are much harder to annotate, and the lack of
suitable annotated training images hinders attempts towards end-to-end
solutions. To address this issue, we opt to automatically synthesize training
images with ground truth pose annotations. Our work is a systematic study along
this road. We find that pose space coverage and texture diversity are the key
ingredients for the effectiveness of synthetic training data. We present a
fully automatic, scalable approach that samples the human pose space for
guiding the synthesis procedure and extracts clothing textures from real
images. Furthermore, we explore domain adaptation for bridging the gap between
our synthetic training images and real testing photos. We demonstrate that CNNs
trained with our synthetic images out-perform those trained with real photos on
3D pose estimation tasks
PointCNN: Convolution On -Transformed Points
We present a simple and general framework for feature learning from point
clouds. The key to the success of CNNs is the convolution operator that is
capable of leveraging spatially-local correlation in data represented densely
in grids (e.g. images). However, point clouds are irregular and unordered, thus
directly convolving kernels against features associated with the points, will
result in desertion of shape information and variance to point ordering. To
address these problems, we propose to learn an -transformation
from the input points, to simultaneously promote two causes. The first is the
weighting of the input features associated with the points, and the second is
the permutation of the points into a latent and potentially canonical order.
Element-wise product and sum operations of the typical convolution operator are
subsequently applied on the -transformed features. The proposed
method is a generalization of typical CNNs to feature learning from point
clouds, thus we call it PointCNN. Experiments show that PointCNN achieves on
par or better performance than state-of-the-art methods on multiple challenging
benchmark datasets and tasks.Comment: To be published in NIPS 2018, code available at
https://github.com/yangyanli/PointCN
MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recogntion
To efficiently extract spatiotemporal features of video for action
recognition, most state-of-the-art methods integrate 1D temporal convolution
into a conventional 2D CNN backbone. However, they all exploit 1D temporal
convolution of fixed kernel size (i.e., 3) in the network building block, thus
have suboptimal temporal modeling capability to handle both long-term and
short-term actions. To address this problem, we first investigate the impacts
of different kernel sizes for the 1D temporal convolutional filters. Then, we
propose a simple yet efficient operation called Mixed Temporal Convolution
(MixTConv), which consists of multiple depthwise 1D convolutional filters with
different kernel sizes. By plugging MixTConv into the conventional 2D CNN
backbone ResNet-50, we further propose an efficient and effective network
architecture named MSTNet for action recognition, and achieve state-of-the-art
results on multiple benchmarks.Comment: Non
GSTO: Gated Scale-Transfer Operation for Multi-Scale Feature Learning in Pixel Labeling
Existing CNN-based methods for pixel labeling heavily depend on multi-scale
features to meet the requirements of both semantic comprehension and detail
preservation. State-of-the-art pixel labeling neural networks widely exploit
conventional scale-transfer operations, i.e., up-sampling and down-sampling to
learn multi-scale features. In this work, we find that these operations lead to
scale-confused features and suboptimal performance because they are
spatial-invariant and directly transit all feature information cross scales
without spatial selection. To address this issue, we propose the Gated
Scale-Transfer Operation (GSTO) to properly transit spatial-filtered features
to another scale. Specifically, GSTO can work either with or without extra
supervision. Unsupervised GSTO is learned from the feature itself while the
supervised one is guided by the supervised probability matrix. Both forms of
GSTO are lightweight and plug-and-play, which can be flexibly integrated into
networks or modules for learning better multi-scale features. In particular, by
plugging GSTO into HRNet, we get a more powerful backbone (namely GSTO-HRNet)
for pixel labeling, and it achieves new state-of-the-art results on the COCO
benchmark for human pose estimation and other benchmarks for semantic
segmentation including Cityscapes, LIP and Pascal Context, with negligible
extra computational cost. Moreover, experiment results demonstrate that GSTO
can also significantly boost the performance of multi-scale feature aggregation
modules like PPM and ASPP. Code will be made available at
https://github.com/VDIGPKU/GSTO
CubemapSLAM: A Piecewise-Pinhole Monocular Fisheye SLAM System
We present a real-time feature-based SLAM (Simultaneous Localization and
Mapping) system for fisheye cameras featured by a large field-of-view (FoV).
Large FoV cameras are beneficial for large-scale outdoor SLAM applications,
because they increase visual overlap between consecutive frames and capture
more pixels belonging to the static parts of the environment. However, current
feature-based SLAM systems such as PTAM and ORB-SLAM limit their camera model
to pinhole only. To compensate for the vacancy, we propose a novel SLAM system
with the cubemap model that utilizes the full FoV without introducing
distortion from the fisheye lens, which greatly benefits the feature matching
pipeline. In the initialization and point triangulation stages, we adopt a
unified vector-based representation to efficiently handle matches across
multiple faces, and based on this representation we propose and analyze a novel
inlier checking metric. In the optimization stage, we design and test a novel
multi-pinhole reprojection error metric that outperforms other metrics by a
large margin. We evaluate our system comprehensively on a public dataset as
well as a self-collected dataset that contains real-world challenging
sequences. The results suggest that our system is more robust and accurate than
other feature-based fisheye SLAM approaches. The CubemapSLAM system has been
released into the public domain.Comment: The paper has been accepted by ACCV 201